feat(telemetry): integrate telemetry tracking across interactive and … by RealKai42 · Pull Request #1798 · MoonshotAI/kimi-cli

RealKai42 · 2026-04-08T09:57:29Z

…live view components

Copilot

Pull request overview

This PR introduces an opt-out telemetry system and wires event tracking through key CLI/UI, soul, background-task, hooks, and auth flows to enable anonymous usage analytics and reliability diagnostics.

Changes:

Added telemetry core: module-level track() buffering, an EventSink for enrichment/batching/periodic flush, and an AsyncTransport with HTTP sending + disk fallback + startup retry.
Instrumented many user interactions and runtime events (slash commands, shortcuts, cancel/interrupt, OAuth refresh, hooks, tool approval/errors, background tasks, MCP connection, etc.).
Added config flag (telemetry: bool) and a new test suite covering telemetry behavior and (nominal) instrumentation expectations.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/telemetry/test_telemetry.py	Adds unit tests for telemetry queueing, sink enrichment/flush, transport disk fallback/retry behaviors.
tests/telemetry/test_instrumentation.py	Adds tests intended to validate emitted event names/properties for production instrumentation paths.
tests/telemetry/init.py	Introduces telemetry test package marker.
tests/core/test_config.py	Updates default config dump expectations to include `telemetry: True`.
src/kimi_cli/web/runner/worker.py	Passes `ui_mode="wire"` into `KimiCLI.create()` for telemetry context in web worker runs.
src/kimi_cli/ui/shell/visualize/_live_view.py	Tracks expand shortcut, question answered/dismissed methods, and run cancel events.
src/kimi_cli/ui/shell/visualize/_interactive.py	Tracks input queueing and immediate steer actions.
src/kimi_cli/ui/shell/slash.py	Tracks various slash-command actions (model/thinking/theme/session ops/web/vis/undo/fork/etc.).
src/kimi_cli/ui/shell/prompt.py	Tracks prompt shortcuts (mode switch, plan toggle, newline, editor, paste).
src/kimi_cli/ui/shell/oauth.py	Tracks login/logout events.
src/kimi_cli/ui/shell/export_import.py	Tracks export/import actions.
src/kimi_cli/ui/shell/init.py	Starts/stops periodic telemetry flush + retry on startup; tracks input classifications, command invalid, exit duration, interrupt step, update prompt.
src/kimi_cli/telemetry/transport.py	Implements async HTTP transport with token/anonymous retry logic plus JSONL disk fallback and retry-on-start.
src/kimi_cli/telemetry/sink.py	Implements buffering, context enrichment, periodic flushing, and sync flush-to-disk for exit.
src/kimi_cli/telemetry/init.py	Implements public telemetry API (`track`, `set_context`, `attach_sink`, `disable`, `flush_sync`) with pre-sink buffering and atexit flush.
src/kimi_cli/subagents/runner.py	Tracks subagent creation.
src/kimi_cli/soul/toolset.py	Tracks tool execution errors with tool name.
src/kimi_cli/soul/slash.py	Tracks init completion and YOLO toggle state changes.
src/kimi_cli/soul/kimisoul.py	Tracks skill/flow invocation, MCP connection outcomes, and API error classification.
src/kimi_cli/soul/approval.py	Tracks tool approval/rejection outcomes (including cancel path).
src/kimi_cli/hooks/engine.py	Tracks hook trigger outcomes (allow/block) with event type.
src/kimi_cli/config.py	Adds `telemetry` boolean config option (default enabled).
src/kimi_cli/cli/init.py	Passes `ui_mode` down to `KimiCLI.create()` for consistent telemetry context.
src/kimi_cli/background/manager.py	Tracks background task creation and completion outcomes with duration/success.
src/kimi_cli/auth/oauth.py	Tracks first-launch device ID creation and OAuth refresh success/failure; adds cached token accessor for transport.
src/kimi_cli/app.py	Initializes telemetry based on config/env, attaches sink with context, and tracks startup events/perf.
src/kimi_cli/acp/server.py	Passes `ui_mode="acp"` into `KimiCLI.create()` for telemetry context in ACP server mode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T10:02:47Z

+                            if retry_resp.status >= 500:
+                                raise _TransientError(f"HTTP {retry_resp.status}")
+                            elif retry_resp.status >= 400:
+                                # Client error (4xx) — not recoverable, don't retry
+                                logger.debug(
+                                    "Anonymous retry got client error HTTP {status}, dropping",
+                                    status=retry_resp.status,
+                                )
+                                return
+                            return
+                    elif resp.status >= 400:
+                        raise _TransientError(f"HTTP {resp.status}")


AsyncTransport._send_http currently raises _TransientError for any HTTP status >= 400 (except the anonymous-retry 4xx early-return). This means permanent client errors like 400/403 will be treated as transient and will repeatedly fall back to disk + retry until expiry, potentially generating lots of on-disk telemetry files without any chance of recovery. Consider classifying 4xx (except possibly 429) as non-transient and dropping them (or logging once), reserving _TransientError for timeouts/connection issues and 5xx.

Suggested change

if retry_resp.status >= 500:

raise _TransientError(f"HTTP {retry_resp.status}")

elif retry_resp.status >= 400:

# Client error (4xx) — not recoverable, don't retry

logger.debug(

"Anonymous retry got client error HTTP {status}, dropping",

status=retry_resp.status,

)

return

return

elif resp.status >= 400:

raise _TransientError(f"HTTP {resp.status}")

if retry_resp.status >= 500 or retry_resp.status == 429:

raise _TransientError(f"HTTP {retry_resp.status}")

elif retry_resp.status >= 400:

# Client error (4xx, except 429) — not recoverable, don't retry

logger.debug(

"Anonymous retry got client error HTTP {status}, dropping",

status=retry_resp.status,

)

return

return

elif resp.status >= 500 or resp.status == 429:

raise _TransientError(f"HTTP {resp.status}")

elif resp.status >= 400:

# Client error (4xx, except 429) — not recoverable, don't retry

logger.debug(

"Telemetry got client error HTTP {status}, dropping",

status=resp.status,

)

return

Copilot · 2026-04-08T10:02:47Z

+        try:
+            path = _telemetry_dir() / f"failed_{uuid.uuid4().hex[:12]}.jsonl"
+            with open(path, "a", encoding="utf-8") as f:
+                for event in events:
+                    f.write(json.dumps(event, ensure_ascii=False, separators=(",", ":")))
+                    f.write("\n")
+            logger.debug(


Telemetry fallback files are written under ~/.kimi (via get_share_dir()) with default mkdir/open permissions, which may be group/world-readable depending on umask. Since these JSONL files include device_id/session_id and environment context, it would be safer to explicitly restrict permissions (e.g., chmod 0o700 on the telemetry dir and 0o600 on failed_*.jsonl files, similar to auth/oauth.py’s _ensure_private_file).

Copilot · 2026-04-08T10:02:48Z

+        transport = AsyncTransport(endpoint="https://mock.test/events")
+
+        call_count = 0
+
+        async def mock_send_http(payload: dict) -> None:
+            nonlocal call_count
+            call_count += 1
+            if call_count == 1:
+                # Simulate: first call succeeds (no _TransientError), events are "sent"
+                # But we need to test the internal 4xx path. We'll test via send() + save_to_disk.
+                pass
+
+        # More direct: patch _send_http to NOT raise (simulating 4xx handled internally)
+        # The 4xx path returns without raising, so send() should not call save_to_disk.
+        with (
+            patch.object(transport, "_send_http", new_callable=AsyncMock),


test_anonymous_retry_4xx_drops_events doesn’t currently exercise the “anonymous retry returned 4xx” branch in AsyncTransport._send_http. It patches _send_http itself (bypassing all status-handling logic) and also defines mock_send_http/call_count that are unused. To validate the behavior, mock the aiohttp session.post responses so the first request returns 401 with a token and the anonymous retry returns a 4xx, then assert send() does not call save_to_disk().

Suggested change

transport = AsyncTransport(endpoint="https://mock.test/events")

call_count = 0

async def mock_send_http(payload: dict) -> None:

nonlocal call_count

call_count += 1

if call_count == 1:

# Simulate: first call succeeds (no _TransientError), events are "sent"

# But we need to test the internal 4xx path. We'll test via send() + save_to_disk.

pass

# More direct: patch _send_http to NOT raise (simulating 4xx handled internally)

# The 4xx path returns without raising, so send() should not call save_to_disk.

with (

patch.object(transport, "_send_http", new_callable=AsyncMock),

transport = AsyncTransport(

get_access_token=lambda: "test-token",

endpoint="https://mock.test/events",

)

first_resp = MagicMock()

first_resp.status = 401

first_resp.__aenter__ = AsyncMock(return_value=first_resp)

first_resp.__aexit__ = AsyncMock(return_value=False)

second_resp = MagicMock()

second_resp.status = 400

second_resp.__aenter__ = AsyncMock(return_value=second_resp)

second_resp.__aexit__ = AsyncMock(return_value=False)

mock_session = MagicMock()

mock_session.post.side_effect = [first_resp, second_resp]

mock_session.__aenter__ = AsyncMock(return_value=mock_session)

mock_session.__aexit__ = AsyncMock(return_value=False)

with (

patch("kimi_cli.utils.aiohttp.new_client_session", return_value=mock_session),

Copilot · 2026-04-08T10:02:48Z

+"""Tests for telemetry instrumentation correctness.
+
+These tests verify that track() calls in production code emit the correct events
+with the correct properties under the correct conditions. They do NOT test the
+telemetry infrastructure itself (that's in test_telemetry.py).
+
+Strategy: patch `kimi_cli.telemetry.track` at the call site and assert the
+event name / properties, rather than running the full UI/soul stack.


The module docstring says the strategy is to “patch kimi_cli.telemetry.track at the call site and assert … rather than running the full UI/soul stack”, but most tests below call track() directly (which will still pass even if the production instrumentation is removed). Either update the docstring to match the intended scope (event schema tests), or change the tests to patch/execute the actual production functions that should emit these events so they truly validate instrumentation correctness.

Suggested change

"""Tests for telemetry instrumentation correctness.

These tests verify that track() calls in production code emit the correct events

with the correct properties under the correct conditions. They do NOT test the

telemetry infrastructure itself (that's in test_telemetry.py).

Strategy: patch `kimi_cli.telemetry.track` at the call site and assert the

event name / properties, rather than running the full UI/soul stack.

"""Tests for telemetry event behavior and schema correctness.

These tests exercise the telemetry API directly and verify that calls to

`track()` and related helpers produce the expected event names, properties,

queue entries, and sink-forwarded payloads under the correct conditions.

They do NOT verify that specific production UI/soul call sites are still

instrumented, and they do NOT test the transport/infrastructure layer itself

(that coverage belongs in test_telemetry.py).

Keep main's work_dir validation check and our telemetry track call. Track kimi_session_resume only after validation passes to avoid false positives.

…n state, and tool execution errors

- Updated OAuthManager to include specific reasons for failed token refresh attempts in telemetry tracking. - Introduced a new function `classify_api_error` in kimisoul.py to classify API errors and streamline error handling. - Refactored KimiSoul to utilize the new error classification function for telemetry tracking. - Enhanced toolset error tracking to include error types in telemetry events. - Added client information tracking in telemetry context for better analytics. - Implemented retry logic in AsyncTransport for sending telemetry events, with fallback to disk on failure. - Added comprehensive tests for new telemetry features, including client info and compaction tracking. - Ensured that telemetry events include relevant properties for better debugging and analysis.

…rmation

devin-ai-integration

Devin Review found 1 new potential issue.

View 21 additional findings in Devin Review.

devin-ai-integration · 2026-04-16T10:10:42Z

+            set_context(device_id=get_device_id(), session_id=session.id)
+            from kimi_cli.telemetry.sink import EventSink
+            from kimi_cli.telemetry.transport import AsyncTransport
+
+            def _get_token() -> str | None:
+                return oauth.get_cached_access_token(KIMI_CODE_OAUTH_KEY)
+
+            transport = AsyncTransport(get_access_token=_get_token)
+            sink = EventSink(
+                transport,
+                version=VERSION,
+                model=model.model if model else "",
+                ui_mode=ui_mode,
+            )
+            attach_sink(sink)


🟡 Global telemetry state causes event loss and wrong session attribution in multi-session ACP mode

Telemetry uses module-level globals (_sink, _session_id) but KimiCLI.create() is called once per session in multi-session ACP mode (src/kimi_cli/acp/server.py:155-167 and src/kimi_cli/acp/server.py:223-237). Each call to KimiCLI.create() at src/kimi_cli/app.py:296 overwrites _session_id via set_context() and replaces the global _sink via attach_sink() at src/kimi_cli/app.py:310. The previous sink (with buffered events from the earlier session's track("started") and track("startup_perf")) is orphaned — its events are never flushed via HTTP or saved to disk. Additionally, any subsequent track() calls originating from an earlier session's code path will record the wrong session_id (the latest session's ID). Since no periodic flush is started outside the shell UI, and the atexit handler only flushes the latest sink, telemetry events from all but the last-created session are silently lost.

Was this helpful? React with 👍 or 👎 to provide feedback.

… in AsyncTransport

…ptions and asyncio tasks

…ing in hook engine

devin-ai-integration

Devin Review found 1 new potential issue.

View 25 additional findings in Devin Review.

devin-ai-integration · 2026-04-21T11:12:49Z

+                from kimi_cli.telemetry import track
+
+                error_type, status_code = classify_api_error(e)
+                if status_code is not None:
+                    track("api_error", error_type=error_type, status_code=status_code)
+                else:
+                    track("api_error", error_type=error_type)


🟡 Unguarded telemetry in exception handler can mask original error and skip StopFailure hook

In the _agent_loop exception handler, classify_api_error(e) and track() are called without a try/except guard. If either call raises an unexpected exception, the original API/step error e is replaced by the new exception: the raise at line 826 is never reached, the StopFailure hook at lines 813-825 never fires, and the caller (e.g., run_soul_command) catches the wrong exception type — showing a confusing error instead of the real API error message.

The hook engine at src/kimi_cli/hooks/engine.py:244-254 explicitly wraps its telemetry call in try/except with a comment explaining the safety rationale. The same pattern should be applied here for consistency and correctness.

While classify_api_error only does safe operations on well-typed objects (making failures extremely unlikely), the code path is inside an exception handler where correctness of re-raise semantics is critical.

Was this helpful? React with 👍 or 👎 to provide feedback.

# Conflicts: # src/kimi_cli/app.py

devin-ai-integration

Devin Review found 1 new potential issue.

View 28 additional findings in Devin Review.

devin-ai-integration · 2026-04-21T12:47:43Z

+            except Exception:
+                # Unexpected error — leave file for next startup
+                logger.debug("Unexpected error retrying {path}, will try again later", path=path)


🟡 retry_disk_events does not handle TypeError from _build_payload, causing stuck retry files

send() at src/kimi_cli/telemetry/transport.py:161-173 explicitly catches TypeError from _build_payload() and drops the events, with a comment explaining: "Retrying would hit the same TypeError on every reload, so falling back to disk would just create a permanently stuck file."

However, retry_disk_events() calls _build_payload() at line 297 without the same TypeError guard. If a disk file happens to contain events with non-primitive values in properties/context, the TypeError falls through to the generic except Exception: at line 312, which leaves the file for next startup. This creates exactly the "permanently stuck file" scenario that send() was designed to prevent — the file gets retried on every startup, hitting the same TypeError, until the 7-day expiry cleans it up.

Inconsistent error handling between send() and retry_disk_events()

In send() (line 161-173): TypeError → drop events immediately.
In retry_disk_events() (line 297, caught at 312): TypeError → leave file on disk for perpetual retry.

Suggested change

except Exception:

# Unexpected error — leave file for next startup

logger.debug("Unexpected error retrying {path}, will try again later", path=path)

except TypeError:

# Schema violation (non-primitive in properties/context).

# Same as send(): retrying would hit the same error, so delete.

logger.debug("Removing telemetry file with schema violation: {path}", path=path)

path.unlink(missing_ok=True)

except Exception:

# Unexpected error — leave file for next startup

logger.debug("Unexpected error retrying {path}, will try again later", path=path)

Was this helpful? React with 👍 or 👎 to provide feedback.

feat(telemetry): integrate telemetry tracking across interactive and …

3eaf393

…live view components

Copilot AI review requested due to automatic review settings April 8, 2026 09:57

Copilot started reviewing on behalf of RealKai42 April 8, 2026 09:58 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

fix(telemetry): update telemetry tests for improved clarity and coverage

3d43bc2

This comment was marked as resolved.

Sign in to view

merge: resolve conflict with main in session resume flow

da59bbe

Keep main's work_dir validation check and our telemetry track call. Track kimi_session_resume only after validation passes to avoid false positives.

github-actions Bot mentioned this pull request Apr 9, 2026

📊 AI CLI 工具社区动态日报 2026-04-09 gsscsd/big_model_radar#157

Open

RealKai42 added 4 commits April 16, 2026 13:34

merge: resolve conflicts with main for oauth telemetry and update prompt

cd5a16e

feat(telemetry): enhance tracking for OAuth, background tasks, sessio…

58b0928

…n state, and tool execution errors

feat(telemetry): add get_client_info function to retrieve client info…

c35339c

…rmation

This comment was marked as resolved.

Sign in to view

RealKai42 added 2 commits April 16, 2026 16:32

Refactor telemetry event names and add server-side prefix handling

7a0ee06

feat(telemetry): improve exit telemetry handling with timeout for flush

f3f068f

devin-ai-integration Bot reviewed Apr 16, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 17, 2026

📊 AI CLI 工具社区动态日报 2026-04-17 gsscsd/big_model_radar#198

Open

feat(telemetry): enhance user_id handling and event payload structure…

aa7ed63

… in AsyncTransport

github-actions Bot mentioned this pull request Apr 18, 2026

📊 AI CLI 工具社区动态日报 2026-04-18 gsscsd/big_model_radar#203

Open

RealKai42 added 3 commits April 20, 2026 14:12

feat(telemetry): implement crash telemetry handlers for uncaught exce…

d5c2d85

…ptions and asyncio tasks

feat(telemetry): update telemetry event names and improve error handl…

7dfe95e

…ing in hook engine

feat(telemetry): update telemetry endpoint to the production URL

1a79aa0

devin-ai-integration Bot reviewed Apr 21, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into kaiyi/lyon

72f6c65

# Conflicts: # src/kimi_cli/app.py

devin-ai-integration Bot reviewed Apr 21, 2026

View reviewed changes

RealKai42 merged commit c39bc09 into main Apr 21, 2026
16 checks passed

RealKai42 deleted the kaiyi/lyon branch April 21, 2026 12:57

-        transport = AsyncTransport(endpoint="https://mock.test/events")
-        call_count = 0
-        async def mock_send_http(payload: dict) -> None:
-            nonlocal call_count
-            call_count += 1
-            if call_count == 1:
-                # Simulate: first call succeeds (no _TransientError), events are "sent"
-                # But we need to test the internal 4xx path. We'll test via send() + save_to_disk.
-                pass
-        # More direct: patch _send_http to NOT raise (simulating 4xx handled internally)
-        # The 4xx path returns without raising, so send() should not call save_to_disk.
-        with (
-            patch.object(transport, "_send_http", new_callable=AsyncMock),
+        transport = AsyncTransport(
+            get_access_token=lambda: "test-token",
+            endpoint="https://mock.test/events",
+        )
+        first_resp = MagicMock()
+        first_resp.status = 401
+        first_resp.__aenter__ = AsyncMock(return_value=first_resp)
+        first_resp.__aexit__ = AsyncMock(return_value=False)
+        second_resp = MagicMock()
+        second_resp.status = 400
+        second_resp.__aenter__ = AsyncMock(return_value=second_resp)
+        second_resp.__aexit__ = AsyncMock(return_value=False)
+        mock_session = MagicMock()
+        mock_session.post.side_effect = [first_resp, second_resp]
+        mock_session.__aenter__ = AsyncMock(return_value=mock_session)
+        mock_session.__aexit__ = AsyncMock(return_value=False)
+        with (
+            patch("kimi_cli.utils.aiohttp.new_client_session", return_value=mock_session),

-"""Tests for telemetry instrumentation correctness.
-These tests verify that track() calls in production code emit the correct events
-with the correct properties under the correct conditions. They do NOT test the
-telemetry infrastructure itself (that's in test_telemetry.py).
-Strategy: patch `kimi_cli.telemetry.track` at the call site and assert the
-event name / properties, rather than running the full UI/soul stack.
+"""Tests for telemetry event behavior and schema correctness.
+These tests exercise the telemetry API directly and verify that calls to
+`track()` and related helpers produce the expected event names, properties,
+queue entries, and sink-forwarded payloads under the correct conditions.
+They do NOT verify that specific production UI/soul call sites are still
+instrumented, and they do NOT test the transport/infrastructure layer itself
+(that coverage belongs in test_telemetry.py).

Conversation

RealKai42 commented Apr 8, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RealKai42 commented Apr 8, 2026 •

edited by devin-ai-integration Bot

Loading

devin-ai-integration Bot Apr 16, 2026 •

edited

Loading

devin-ai-integration Bot Apr 21, 2026 •

edited

Loading